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[57] ABSTRACT 

A high performance CPU of the RISC (reduced in- 
struction set) type employs a standardized, fixed in- 
struction size, and permits only simplified memory ac- 
cess data width and addressing modes. The instruction 
set is hmited to register-to-register operations and regis- 
ter load/store operations. Byte manipulation instruc- 
tions, included to permit use of previously-established 
data structures, include the facility for doing in-register 
byte extract, insert and masking, along with non-aligned 
load and store instructions. The provision of load/- 
locked and store/conditional instructions permits the 
implementation of atomic byte writes. 

24 Claims, 6 Drawing Sheets 
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length variable from one byte to multiple words, includ- 
ENSURING DATA INTEGRITY BY ing unaligned byte references. 

LOCKED-LOAD AND CONDITIONAL-STORE Reduced instruction set or RISC processors are char- 

OPERATIONS IN A MULTIPROCESSOR SYSTEM acterized by a smaller number of instructions which are 

S simple to decode, and by requiring that all arithmeticA 
RELATED CASES logic operations be performed register-to-register. An- 

This application discloses subject matter also dis- f"^,";* ^ ^» °^ '^^"^H °° ".""P^" °^ 

closed in the foUowing copending applications, filed : '^«?°^ f« register load/store 
herewith and assigned to W^tal Equipment Corpora- ,o 0P«"«>«>»^ «>d there are a small number of retatively 

tion. the assign«tf this invcitionr ??P'« «Wre«wg modes, i.e ody. few ways of spec- 

Ser. No. 547.589. filed Jun. 29. 1990. entiUed «y>ng op««d addresses. Instructwns are of^^^^ 

BRANCH PREDICTION IN HIGH-PERFOR- '^,r^°^''^Jf' ^ T 

MANCE PROCESSOR, by Richard L. Sites and Rich- ^"^^ ^^^^ Inswuction execunon« of the 

ard T. Witek. inventors: .h«dwired type, as distmct from microcodmg. 

Ser. No. 547,630. filed Jun. 29, 1990. entitled IM- T?"* " ' 'i' ,w rl"tn 
PROVING PERFORMANCE IN REDUCED IN- defmed to be relatovely sunple so that they all 

STRUCnON SET PROCESSOR, by iSrdL. Stes " ^^^'^ P'l* "•"'8 

Vn^ u ^ TT «A '."■y^"^'-"^' •-• Spread the actual execution over several cycles), 

and Richard T. Witek. mventors: «*■ oico m.^^,^^ u \^ti^^ 

Ser. No. 547.629, filed Jun. 29, 1990. entitled IM- » ^^f X Lwi^nf^^ul i„«ru^^* 

PROVING BRANCH PERFORMANCE IN HIGH '^r.^^L^L^ld L^ZltZT^Tm 

SPEED PROCESSOR, by Richard L. Sites and Rich- """^^^ "cc^g mod« and data types should resul t m 

-r ur more work being done for each Ime of code (actually, 

"1 ■ . .^'^i-'J^' . ,n . o«« • , ^ «» . vt compUeis do not produce code taking full advanuge of 

.T,^/;^^;,'tJ;S2' ^^^^^ IV.^.'-^^t'i?^^ « this); but whatev« gain in compactnL of source code 

ULARITV HINT FOR TRANSLATION BUFFER 25 J^^^^,^ ^ ftTexpense of execution time. Par- 

IN HIGH PERFORMANCE PROCESSOR, by Rich- SXir«pipelining of Kction execution has be- 

'^l'' ll^f??iVn'^''7;"^^'*I'2:S?'°"^ . ,v, come necessary to achieve performance levels de- 

^SL^Z^r^e™T^,««^i^^^™^=cc^» M lencies of successive instructions, and the vast differ- 
DUCED INSTRUCTION SET PROCESSOR, by 30 ^ ^^^^ ^ ^^^^ ^y^,^ 

Richard L. Sites and Richard T. Wiiek, mventors: ^^^^ excessive stalls and exceptions, slowing execu- 

Ser. No. 07/5476H filed Jun. 29, 1990, enutled IM- [j^n advantage of RISC processors is the speed of 
PROVING COMPUTER PERFORMANCE BY execution of cod^ but the disadvantage is that less is 
ELIMINATING BRANCHES, by Richard L. Sites accomplished by each line of code, and the code to 
and Richard T. Witek, mventors; and accomplish a given task is much more lengthy. One line 

Ser. No. 07/547597. filed Jun. 29, 1990, entitled of VAX code can accomplish the same as many lines of 
BYTE-COMPARE OPERATION FOR HIGH-PER- risc code. 

FORMANCE PROCESSOR, by Richard L. Sites and when CPUs were much faster than memory, it was 
Richard T. Witek. inventors: ^ advantageous to do more work per instruction, because 

BACKGROUND OF THE INVENTION otherwise the CPU would always be waiting for the 

memory to deliver instructions — this factor led to more 

This invention relates to digital computers, and more complex instructions that encapsulated what would be 
particular to a high-performance processor executing a otherwise implemented as subroutines. When CPU and 
reduced instruction set. memory speed became more balanced, a simple ap- 

Complex instruction set or CISC processors are char- proach such as that of the RISC concepts becomes 
acterized by having a large number of instructions in . more feasible, assuming the memory system is able to 
their instruction set, often including memory-to-mem- deliver one instruction and some data in each cycle, 
ory instructions with complex memory access in g Hierarchical memory techniques, as well as faster ac- 
modes. The instructions are usually of variable length, jq cess cycles, provide these faster memory speeds. An- 
with simple instructions being only perhaps one byte in other factor that has influenced the CISC vs. RISC 
length, but the length ranging up to dozens of bytes. choice is the change in relative cost of off-chip vs. on- 
The VAX TM instruction set is a primary example of chip interconnection resulting &om VLSI construction 
CISC and employs instructions having one to two byte of CPUs. Construction on chips instead of boards 
opcodes plus from zero to six operand specifiers, where 35 changes the economics — first it pays to make the archi- 
each operand specifier is from one byte to many bytes in tecture simple enough to be on one chip, then more 
length. The size of the operand specifier depends upon on<hip memory b possible (and needed) to avoid going 
the addressing mode, size of displacement (byte, word off-chip for memory references. A further factor in the 
or longword), etc. The first byte of the operand qieci- comparison is that adding more complex instructions 
fier describes the addressing mode for that operand, to and addressing modes as in a CISC solution complicates 
while the opcode defines the number of operands: one, (thus slows down) stages of the instruction execution 
two or three. When the opcode itself is decoded, how- process. The complex Auction might make the function 
ever, the total length of the instruction b not yet known execute faster than an equivalent sequence of simple 
to the processor because the operand specifiers have not instnictions, but it can lengthen the instruction cycle 
yet been decoded. Another characteristic of processors 63 time, making all instructions execute slower; thus an 
of the VAX type b the use of byte or byte string mem- added function must increase the overall performance 
ory references, in addition to quadword or longword enough to compensate for the decrease in the instruc- 
references; that is, a memory reference may be of a tion execution rate. 
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The performance advantages of RISC processors, 
taJcing into account these and other factors, is consid- 
ered to outweigh the shortcomings, and, were it not for 
the esistiag software base, most new processors would 
probably be designed using RISC features. A problem is i 
that business enterprises have invested many years of 
operating background, including operator training as 
well as the cost of the code itself, in applications pro- 
grams and data structures using the CISC type proces- 
sors which were the most widely used in the past ten or 10 
fifteen years. The expense and disruption of operations 
to rewrite aU of the code and data structures to accom- 
modate a new processor architecture may not be justi- 
fied, even though the performance advantages ulti- 
matdy expected to be adiieved would be substantial. IS 

Accordingly, the objective is to accomplish all of the 
performance advantages of a RISC-type processor ar- 
chitecture, but yet allow the data structures and code 
previously generated for existing CISC-type processors 
to be translated for use in a higb-perfomance proces- 20 

SOT. 

SUMMARY OF THE IKVENTION 

In accordance with one embodiment of the invention, 
a high-performance processor is provided which is of 23 
the RISC type, using a standardized, fixed instruction 
size, and permitting only a simplified memory access 
data width, using simple addressing modes. The instruc- 
tion set is limited to register-to-register operations (for 
arithmetic and logic type operations using the ALU, 30 
etc.) and register load/store operations where memory 
B referenced; there are no memory-to-memory opera- 
tions, nor register-to-memory operations in which the 
ALU or other logic functions are done. The functions 
performed by instructions are limited to allow non- 35 
microcoded implementation, simple to decode and exe- 
cute in a short cycle. On<hip floating point processing 
is provided, and on<hip instruction and data caches are 
employed in an example embodiment. 

Byte manipulation instructions are included to permit 40 
use of previously-established data structures. These 
instructions include the facility for doing in-register 
byte extract, insert and masking, along with non-aligned 
load and store instructions, so that byte addresses can be 
made use of even though the actual memory operations 4S 
are aligned quadword in nature. 

The provision of load/locked and store/conditional 
instructions permits the implementation of atomic byte 
writes. To write to a byte address in a multibyte (e.g., 
quadword) aligned memory, the CPU loads a quad- SO 
word (or longword) and locks this location, writes to 
the byte address in register while leaving the remainder 
of the quadword undisturbed, then stores the updated 
quadword in memory conditionally, depending upon 
whether the quadword has been written by another iS 
processor since the load/locked operation. 

Another byte manipulation instruction, according to 
one feature of the invention, is a byte compare instruc- 
tion. AU bytes of a quadword in a register are compared 
to corresponding bytes in another register. The result is 60 
a single byte (one bit for each byte compared) in a third 
register. Since this operation is done to a general pur- 
pose register (rather than to a special hardware loca- 
tion), several of the byte compares can be done in se- 
quence, and no added state must be accounted for upon 63 
interrupt or the like. This byte compare can be used to 
advantage with a byte zeroing instruction in which 
selected bytes of a quadword are zeroed, with the bytes 
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being selected by bits in a low-order byte of a register. 
That is, the result of a byte compare can be used to zero 
bytes of another register. 

Speed of execution is highly dependent on the se- 
quentiality of the instruction stream; branches disrupt 
the sequence and generate stalls while the prefetched 
instruction stream is flushed and a new sequence is 
begun. By providing a conditional move instruction, 
many short branches can be eliminated altogether. A 
conditional move instruction tests a register and moves 
a second register to a third if the condition is met, this 
fimction can be substituted for short branches and thus 
maintain the sequentiality of the instruction stream. 

If branches caimot be avoided, the performance can 
be speeded up by predicting the target of a branch and 
prefetching the new instruction based upon this predic- 
tion. According to a feature of one embodiment, a 
branch prediction rule is followed that requires all for- 
ward branches to be predicted not-taken and all back- 
ward branches (as is conunon for loops) to be predicted 
as taken. Upon compilation, the code is rearranged to 
make sure the most likely path is backward rather than 
forward, so more often than not the predicted path is 
taken and the proper instruction is prefetched. 

Another performance improvement is to make use of 
unused bits in the standard-sized instruction to provide 
a bint of the expected target address for jump and jump 
to subroutine instructions or the tike. The target can 
thus be prefetched before the actual address has been 
calculated and placed in a register. If the target address 
of the hint matches the calculated address when the 
instruction is executed, then the prefetched address is 
already in the pipeline and will execute much faster. 
The hint is added to the jump instruction by the com- 
piler. 

In addition, the unused displacement part of the jump 
instruction can contain a field to define the actual type 
of jump, i.e., jump, jump to subroutine, return from 
subroutine, and thus place a predicted target address in 
a stack to allow prefetching before the instruction has 
been executed, or take other action appropriate to the 
operation defined by the hint. A hint may be ignored by 
the hard'word, and if so the code still executes properly, 
just slower. 

According to a feature of one embodiment, the pro- 
cessor employs a variable memory page size, so that the 
entries in a translation buffer for implementing virtual 
addressing can be optimally used. A granularity hint is 
added to the page t^le entry to define the page size for 
this entry. If a large number of sequential pages share 
the same protection and access rights, all of these pages 
can be referenced with the same page table entry, and so 
the use of the translation buffer becomes more efficient. 
The likelihood of a hit in the translation buffer is in- 
creased, so the number of faults to access the page ubies 
is minimized. 

An additional feature is the addition of a prefetch 
instruction which serves to move a block of data to a 
faster-access cache in the memory hierarchy before the 
data block is to be used. This prefetch instruction would 
be insetted by the compiler to perform a function simi- 
lar to that of a vector processor, but does not require 
vector hardware. The prefetch instruction does not' 
generate memory exceptions or protection or access 
violations, and so does not slow down execution if the 
prefetch fails. Again, the instruction is optional, and if 
the processor cannot execute it the normal code exe- 
cutes without problems. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The novel features believed characteristic of the in- 
vention are set forth in the appended claims. The inven- 
tion itself, however, as well as other features and advan- S 
tages thereof, will be best understood by reference to 
the detailed description of specific embodiments which 
follows, when read in conjunction with the accompany- 
ing drawings, wherein: 

FIG. 1 is an electrical diagram in block form of a 10 
computer system employing a CPU which may employ 
features of the invention; 

FIG. 2 is a diagram of data types used in the proces- 
sor of FIG. 1; 

FIG. 3 is an electrical diagram in block form of the IS 
instruction unit or I-box of the CPU of FIG. 1; 

FIG. 4 is an electrical diagram in block form of the 
Integer execution unit or E-box in the CPU of FIG. 1; 

FIG. 5 is an electrical diagram in block form of the 
addressing unit or A-box in the CPU of FIG. 1; 20 

FIG. 6 is an electrical diagram in block form of the 
floating point execution unit or F-box in the CPU of 
FIG. 1; 

FIG. 7 is a timing diagram of the pipelining in the 
CPU of FIGS. 25 

FIG. 8 is a diagram of the instruction formats used in 
the instruction set of the CPU of FIGS. 1-6; 

FIG. 9 is a diagram of the format of a virtual address 
used in the CPU of FIGS. 1-6; 

FIG. 10 is a diagram of the format of a page table 30 
entry used in the CPU of FIGS. 1-6; and 

FIG. 11 is a diagram of the addressing translation 
mechanism used in the CPU of FIGS. 1-6; 



DETAILED DESCRIPTION OF SPECIFIC 
EMBODIMENT 



35 



Referring to FIG. 1, a computer system which may 
use features of the invention, according to one embodi- 
ment, includes a CPU 10 connected by a system bus 11 
to a main memory 12, with an I/O unit (not shown) also 40 
accessed via the system bus. The system may be of 
various levels, from a stand-alone workstation up to a 
mid-range multiprocessor, in which case the other 
CPUs such as a CPU 15 also access the main memory 12 
via the system bus 11. 45 

The CPU 10 is preferably a single-chip integrated 
circuit device, although features of the invention could 
be employed in a processor constructed in multi-chip 
form. Within the signal chip an integer execution unit 16 
(referred to as the "E-box") is included, along with a SO 
floating point execution imit 17 (referred to as the "F- 
box"). Instruction fetch and decoding is performed in 
an instruction unit 18 or "I-box", and an address unit or 
"A-box" 19 performs the functions of address genera- 
tion, memory management, write buffering and bus 35 
interface. The memory is hierarchical, with on<hip 
instruction and data caches being included in the in- 
struction unh 18 and address unit 19 in one embodiment, 
while a larger, second-level cache 20 is provided off- 
chip, being controlled by • cache controller in the ad- 60 
dress unit 19. 

The CPU 10 employs an instruction set as described 
bdow in which all instructions are of a fixed sixe, in this 
case 32-bit or one longword. The instruction and data 
types employed are for byte, word, longword and quad- 6S 
word, as illustrated in FIG. 2. As used herein, a byte is 
8-bits, a word is 16-bits or two bytes, a longword is 
32-bits or four bytes, and a quadword is 64-bits or eight 



bytes. The data paths and registers within the CPU 10 
are generally 64-bit or quadword size, and the memory 
12 and caches use the quadword as the basic unit of 
transfer. Performance is enhanced by allowing only 
quadword or longword loads end stores, although, in 
order to be compatible with data types used in prior 
software development, byte manipulation is allowed by 
certain unique instructions, still maintaining the feature 
of only quadword or longword loads and stores. 

Referring to FIG. 3, the instruction unit 18 or I-box is 
shown in more detail. The primary function of the in- 
struction unit 18 is to issue instructions to the E-box 16, 
A-box 19 and F-box 17. The instruction unit 18 includes 
an instruction cache 21 which stores perhaps 8 Kbytes 
of instruction stream data, and a quadword (two in- 
structions) of this instruction stream data is loaded to an 
instruction register 22 in each cycle where the pipeline 
advances. The instruction unit 18, in a preferred em- 
bodiment, decodes two instructions in parallel in decod- 
ers 23 and 24, then checks that the required resources 
are available for both instructions by check circuitry 25. 
If resources are available and dual issue is possible then 
both instructions may be issued by ^plying register 
addresses on busses 26 and 27 and control bits on mi- 
crocontrol busses 28 and 29 to the appropriate elements 
in the CPU 10. If the resources are available for only the 
first instruction or the instructions cannot be dual issued 
then the instruction unit 18 issues only the first instruc- 
tion from the decoder 23. The instruction unit 18 does 
not issue instructions out of order, even if the resources 
are available for the second instruction (from decoder 
24) and not for the first instruction. The instruction unit 
18 does not issue instructions until the resources for the 
first instruction become available. If only the first of a 
pair of instructions issues (from the decoder 23), the 
instruction luit 18 does not advance another instruction 
into the instruction register 22 to attempt to dual issue 
again. Dual issues is only attempted on aligned quad- 
word pairs as fetched from memory (or instruction 
cache 21) and loaded to instruction register 22 as an 
aligned quadword. 

The instruction unit 18 contains a branch prediction 
circuit 30 responsive to the instructions in the instruc- 
tion stream to be loaded into register 22. The prediction 
circuit 30 along with a subroutine return stack 31 is used 
to predict branch addresses and to cause address gener- 
ating circuitry 32 to prefetch the instruction stream 
before needed. The subroutine return stack 31 (having 
four-entries, for example) is controlled by the hint bits 
in the jump, jump to subroutine and return instructions 
as will be described. The virttial PC (program counter) 
33 is included in the address generation circuitry 32 to 
produce addresses for instruction stream data in the 
selected order. 

One branch prediction method is the use of the value 
of the sign bit of the branch displacement to predict 
conditional branches, so the circuit 30 is responsive to 
the sign bit of the displacement appearing in the branch 
instructions appearing at inputs 35. If the sign bit is 
negative, it predicts the branch is taken, and addressing 
circuit 32 adds the displacement to register Ra (one of 
the registers of register set 43, as selected by field Ra of 
the instruction) to produce the first address of the new 
address sequence to be fetched. If the sign is positive it 
predicts not taken, and the present instruction stream is 
continued in sequence. 

The instruction unit 18 contains an g-entry fully asso- 
ciative translation buffer (TB) 36 to cache recently used 
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tnstroctioihstream address translations and protection The write buffer 50 has two purposes (1) To minimize 
information for 8 Kbyte pages. Although 64-bit ad- the number of CPU stall cycles by providing a high 
dresses are nominally possible, as a practical matter bandwidth (but finite) resource for receiving store data. 
43-bit addresses are adequate for the present. Every This is required since the CPU 10 can generate store 
cycle the 43-bit virtual program counter 33 is presented 5 data at the peak rate of one quad word every CPU cycle 
to the instruction stream TB 3«. If the page table entry which may be greater than the rate at which the exter- 
(PTE) associated with the virtual PC is cached in the nal cache 20 can accept the data; and (2) To anempt to 
TB 36 then the page frame number (PFN) and protec- aggregate store dau into aligned 32-byte cache blocks 
tion bits for the page which contains the virtual PC is tot the purpose of m a x i mizin g the rate at which data 
used by the instruction unit U to complete the address 10 may be written from the CPU 10 into the external cache 
translation and access checks. A physical address Is thus 20. The write buffer 50 has eight entries. A write buffer 
applied to the address input 37 of the instruction cache » "valid if it does not contain data to be written 

21. or if there is a cache miss then this instruction stream » valid if it contams daU to be wntten. The wntc 

physical address is applied by the bus 38 through the ^"ff^^ » contams two pointers: the head pomter 57 and 
iddraTunit 19 to the cache 20 or memory 12. In a " the tail pomter 58. The b«d po«tCT57 POjnts » the 
preferred embodiment, the instruction stream TB 36 vahd wnte birffer entty ^.»".^l«J^,'*« ^^l)^^,*' 
supports any of the four granularity hint block sizes as ««!.• Pf"f<» T^.*^ ^^^llZi^r 

de^ belcJw. so that theVobabililty of a hit in the TB ^^'^ ^'^JP' ':^''L'^^^i^„^!^l^.}l 
36 • inertased v i the wnte buffer 50 is completely full (empty) the head 

M tt tncreasea. ^ • ^ 20 and tail pointers point to the same valid (invalid) entry. 

. ^,' "^"°!; T ^* * ?f "^r^^ '"T Each tiiSe the write buffer 50 is presentid with a new 
deuul m FIO. 4. The execuUon umt 16 contains the ^ 

64-bit mteger ^ecution datapath « ^^^^^^ ^ompa^ to the addr4 in each valid 

meuc/logic unit (ALU) 40, a barrel shifter 41 and an ^^^^ entry. If the address is in the same aligned 

mteger m^ltlpller 42 The execution umt 16 also con- „ y^,^ ^ ^ ^ buffer entry 

tarns the 32-register 64.b.t wide rej^ter file 43. contain- ^^^^ tl^, „try and the 

mg registers RO to R31, although R31 is hardwired as ,^ longword mask bits are updated. If no matching 

aU zeros. The register file 43 has four read ports and two ^^^^ ^ ^je buffer then the store data is 

write ports which allow the sourcing (sinking) of oper- j^j^ ^j,^ designated by the taU pointer 58, 

ands (results) to both the mteger execution datapath and jj, ^^ validated, and the tail pointer 58 is incre- 

the address unit 19. A bus structure 44 connects two of mented to the next entry. 

the read ports of the register file 43 to the selected address unit 19 contains a fully folded memory 

inputs of the ALU 40, the shifter 41 or the multiplier 42 reference pipeline which may accept a new load or 

as specified by the control bits of the decoded instruc- instruction every cycle until a fill of a daU cache 

tion on busses 28 or 29 from the instruction unit 18, and jj 59 ("D<Bche") is required. Since the dau cache 59 lines 

connects the output of the appropriate function to one „g q^Jj, allocated on load misses, the address unit 19 

of the write ports to store the resuh. That is, the address accept a new instruction every cycle until a load 

fields from the instruction are applied by the busses 26 f„i^ occurs. When a load miss occurs the instruction 

or 27 to select the registers to be used in execution the unit IB stops issuing all instructions that use the load 

instruction, and the control bits 28 or 29 define the 40 port of the register file 43 Goad, store, jump subroutine, 

operation in the ALU, etc., and defines which internal jjc., instructions). 

busses of the bus structure 44 are to be used when, etc. Since the result of each data cache 59 lookup is 
The A-box or address unit 19 is shown in more detail known late in the pipeline (stage S7 as will be described) 
in FIG. 5. The A-box 19 includes five functions: address g^i instructions are issued in pipe stage S3, there may 
translation using a translation buffer 48, a load silo 49 4$ two instructions in the address unit 19 pipeline be- 
for incoming data, a write buffer 50 for outgoing write i^nd a load instruction which misses the data cache 59. 
data, an interface 51 to a data cache, and the external These two instructions are handled as follows: First, 
interface 52 to the bus 11. The address translation data- loads which hit the data cache 59 arc allowed to corn- 
path has the displacement adder 53 which generates the piete, hit under miss. Second, load misses are placed in 
effective address (by accessing the register file 43 via 50 the sUo 49 and replayed in order after the first load miss 
the second set of read and write ports, and the PC), the completes. Third, store instructions are presented to the 
data TB 48 which generates the physical address on data cache 59 at their normal time with respect to the 
address bus 54, and muxes and bypassers needed for the pipeline. They are silo'ed and presented to the write 
pipelining. buffer 50 in order with respect to load tnisses. 

The 32-entry fully associative data translation buffer 55 The on-chip pipelined floatiog point unit 17 or F-box 
48 caches recently-used data-stream page table entries as shown in more detail in FIG. 6 is capable of execut- 
for 8 Kbyte pages. Each entry supports any of the four ing both DEC and IEEE floating point instructions 
granularity hint block sizes, and a detector 55 is respon- according to the instruction set to be described. The 
(ive to the granularity hint as described below to floating point unit 17 contains a 32-entry, 64-bit, floating 
change the number of low-order bits of the virtual ad- 60 point register file 61, and a floating point arithmetic and 
dress passed through from virtual address bus 56 to the logic unit 62. Divides and multiplies are performed in a 
physical address bus 54. multiply/divide circuit 63. A bus structure 64 intercon- 
For load and store instructions, the effective 43-bit nects two read ports of the register file 61 to the appro- 
virtual address is presented to TB 48 via bus 56. If the priate functional circuit as directed by the control bits 
PTE of the supplied virtual address is cached in the TB 63 of the decoded instruction on busses 28 or 29 from the 
48, the PFN and protection bits for the page which instruction unit 18. The registers Selected for an opera- 
contains the address are used by the address unit 19 to tion are defined by the output buses 26 or 27 from the 
complete the address translation and access checks. instruction decode. The floating point unit 17 can ac- 
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cept tn instruction every cycle, with the exception of (pipelined as described below, and assuming no stalls), 
floating point divide instructions, which can be ac- rather than a variable number of cycles. The instruction 
cepted only every several cycles. A latency of more set includes only register-to-register arithmetic/logic 
tboa one cycle is exhibited for all floating point instruc- type of operations, or register-to-memory (or memory- 
tions. 5 to-register) load/store type of operations, and there are 

In an example embodiment, the CPU 10 has an 8 qq complex memory addressing modes such as indirect, 
Kbyte dau cache 59, and 8 Kbyte instruction cache 21, etc. An instruction performing an operation in the ALU 
with the size of the caches depending on the available 40 giways gets its operands from the register file 43 (or 
chip area. The on-chip data cache 59 is write-through, f^pj a field of the instruction itselQ and always writes 
direct mapped, read-allocate physical cache and has 10 the result to the regbter file 43; these operands are 
J4-byte (1-hexaword) blocks. The system may keep the j^yg^ obtained from memory and the result is never 
cache 59 coherent with memory 12 by using an invali- ^n-ritten to memory. Loads from memory are always to 
date bus, not shown. The cache 59 has longword parity , register in register fUea 43 or «1, and stores to memory 
in the data array 66 aiid there is a parity bit for each tag ^ ^^^^ , ^^^^ f^es. 
entry m tag store 67. „ „^ ,^ " Referring to FIG. 7, the CPU 10 has a seven stage 

The iMtructKW cache 21 may be 8 Kbyt« or 16 • ^j,^ ^ „j memory reference in- 

Kbytes. for ex«nple, or may be hirger OTmaOer. de- jhe instruction unit 18 has a seven stage 

p«dmg upon die area. Althou^d«cnbed above as determine instruction cache 21 hit/miss. 

physical addressing with a TB 36. it may also be a 7 , jj, ^le pipeline of execu- 

v«ual cache, u which case it wiU contam no provwion 20 ^ ^, 16. tosuuction ^8 and addr^ unit 19. The 

for maintainmg its coherence with memory 12. If the >— . j c ^i„i:_. j_ 

cache 21 is a physical addressed cache the chip will "~'»« ^ J ^^Zl^lZt 

contain circuit for maintaining its coherence with *»« of ^ but ordinarily employs 

memory: (I) when the write buffw 50 entries are sent to «*8«Jo sug« are referred to 

the internal interface 52, the address will be compared 25 « where a stage is to be executed m one machine 

against a duplicate instruction cache 21 tag. and the cyde (clock cycle). Tht first four stages SO, SI, S2 and 
corresponding block of instruction cache 21 will be S3 are executed m the mstruction unit 18, and the last 
conditionally invalidated; (2) the invalidate bus will be three stages S4. S5 and S6 are executed m one or the 
connected to the instruction cache 21. other of the execution unit 16 or address unit 19, de- 

The main data paths and registers in the CPU 10 are jq pending upon whether the instruction is an operate or a 
aU 64-biu wide. That is. each of the integer registers 43, load/store. There are bypassers in all of the boxes thai 
as well as each of the floating point registers 61, is a allow the results of one instruction to be used as oper- 
64-bit register, and the ALU 40 has two 64-bit inputs ands of a following instruction without having to be 
40a and 46b and a 64-bit output 40c:. The bus structure written to the register file 43 or 61. 
44 in the execution unit 16, which actually consistt of 33 The first stage SO of the pipeline is the instruction 
more than one bus, has 64-bit wide data paths for trans- fetch or IF stage, during which the instruction unit 18 
ferring operands between the integer registers 43 and fetches two new instructions from the instruction cache 
the inpute and output of the ALU 40. The instruction 21, using the PC 33 address as a base. The second stage 
decoders 23 and 24 produce register address outputs 26 SI is the swap stage, during which the two fetched 
and 27 which are applied to the addressing circuits of ^ instructions are evaluated by the circuit 25 to see if they 
the integer registers 43 and/or floating point registers can be issued at the same time. The third stage S2 is the 
61 to select which register operands are used as inputs decode stage, during which the two instructions are 
to the ALU 40 or 62, and which of the registers 43 or decoded in the decoders 23 and 24 to produce the con- 
registers 61 is the destination for the ALU (or other trol signals 28 and 29 and register addresses 26 and 27. 
functional unit) output. ^ j The fourth stage S3 is the register file 43 access suge for 

The dual issue decision is made by the circuitry 25 operate instructions, and also is the issue check decision 
according to the following requirement, where only point for all instructions, and the instruction issue stage, 
one instruction from the first column and one instruc- The fifth stage S4 is cycle one of the computation (in 
tion from the second column can be issued in one cycle: 40^ fof eumple) if it is an operate instruction, and 

50 also the instruction unit 18 computes the new PC 33 in 
address generator 32; if it is a memory reference instruc- 
tion the address unit 19 calculates the effective data 
stream axldrets using the adder 53. The sixth sUge S5 is 
cycle two of the computation (e.g., in ALU 40) if it is an 
ss operate instruction, and also the data TB 48 lookup 
stage for memory references. The last stage S6 is the 
write stage for operate instructions having a register 
That is, the CPU 10 can allow dual issue of an integer write, during Which, for example, the output 40e of the 
load or store instruction with an integer operate instruc- ALU 40 is written to the register file 43 via the write 
tion, but not an integer branch with an integer load or 60 port, and is the data cache 59 or instruction cache 21 
store. Of course, the circuitry 25 also checks to see if the hit/miss decision point for instruction stream or data 
resources are available before allowing two instructions stream references. 

to issue in the same cycle. The CPU 10 pipeline divides these seven stages 

An important feature is the RISC characteiistic of the S0-S6 of instruction proces^g into four static and 
CPU 10 of FIOS. 1-6. The instructions executed by this 6S three dynamic steges of execution. The first four stages 
CPU 10 are always of the same size, in this case 32-bits, S0-S3 consist of the instruction fetch, swap, decode and 
instead of allowing variable-length instructions. The issue logic as just described. These stages S<K-S3 are 
instructions execute on average in one machine cycle static in that instructions may remain valid in the same 
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Column B 


Integer Operate 


Flouing Operate 


Floating Lo*d/Ston 


Inleger Load/Slore 


FkKtmg Bnnch 


Inlcger Branch 


JSR 



04/26/2004, EAST Version: 1.4.1 



5,193,167 

11 12 

pipeline stage for multiple cycles while waiting for a The operate instructions 72 and 73 are of the formats 
resource or stalling for other reasons. These stalls are shown in FIG. 8, one format 72 for three register oper- 
•bo referred to as pipeline freezes. A pipeline freeze ands and one format 73 for two register operands and a 
may occur while zero instructions issue, or while one literal. The operate format is used for instructions that 
instruction of a pair issues and the second is held at the 5 perform integer register operations, allowing two 
issue sttge. A pipeline freeze implies that a valid instnic- source operands and one destination operand in register 
tion or instructions is (are) presented to be issued but file 43. One of the source operands can be a Uteral con- 
can not proceed. Bit-12 deflnes whether the operate instruction b 

Upon satisfying all issue requirements, instructions for a two source register operation or one source regis- 

are allowed to continue through the pipeline toward •<> »«" «wl • 1>«««1- ^ •Edition to the 64>it opcode at bits 

completion. After issuing in S3, instructions can not be <31:26>, the operate format has a 7-bit fiuicnon field 

held in a given pipe stage S*-S«. It is up to the issue »» bits <U:5> to aUow a widerrange of choices for 

stage S3 (circuitry 25) to insure that all resource con- «rithmeuc and logical operauon. The source register Ra 

fiicts are resolved before an instruction is allowed to « specified in dttier case at bits <iSi2l> , and the desti- 

continue. Tbe only means of stopping instructions after " ««f<>° '«8«ter Rc »t <4:0>. If a bit-U is a zero, the 

the issue suge S3 is «i abort condition. source register Rb b defined at bits <MM> whUe if 

Aborts mTy result from a number of causes. In gen- «>it.l2 « a one tfi« an ^bit zero<xteDd^hte^ con- 

eral. they miy be grouped into two classes, naiely ««« S° 

exc^ptiom (including inWupts) and non-exceptions. ,„ « interpreted as a pwrtive mteger m the range 

TSsicdifferenceLwe^^^^^^^^^^ '^"^^"^L'^^TnttSvo^. operate in- 

reqjnre that the pipelme be flush^ of all ^ instructions that^erform 

whidh were fetched subsequent to Ae mstruction which ^ ^, ^ ^^^^^ „ 

caused the abort condition, mcluding dual >««d ui- ^^^^ i„, I, instructions con- 

strurtions^ and restart the instruction fetch at the redi- „ ^ , „ <3i:26> „ before, along 

rected address. Examples of non-exc^tion abort condi- ^ j^^^ ^, ^its <15:5>. There are 

bons are branch mispredictions, subroutine call and ^ ^^j^^ Fa. Fb and Fc, each specifying 

retiim mispredictions and instruction cache 21 misses. ^^^^^ ^ ^ ^ floating-point operand as defmed 

Dau cache 59 misses do not produce abort conditions instruction; only the registers 13 are specified by 

but can cause pipeline freezes. 30 Fa, Fb and Fc, but these registers can contain either 

In the event of an exception, the CPU 10 first aborts floating-point values. Uterals are not sup- 

afl instructions issued after the exceptmg mstruction. ported. Floating point conversions use a subset of the 

Due to the nature of some error conditions, this may floating point operate format 74 of FIG. 8 and perform 

occur as late as the write cycle. Next, the address of tiie register-to-register conversion operations; the Fb oper- 

excepting instruction is latched in an internal processor 35 specifies the source and the Fa operand should be 

register. When the pipeline is fully drained the proces- reg-31 (all zeros). 

sor begins instruction execution at the address given by jjie other instruction format 75 of FIG. 8 is that for 

a dispatch using a PAL code. The pipeline is drained privileged architecture library (PAL or PALcode) in- 

when all outstanding writes to both the integer and structions, which are used to specify extended proces- 

floating point register file 43 and 61 have completed and ^o gor functions. In these instructions a 6-bit opcode is 

all outstanding instructions have passed the point in the present at bits <3I:26> as before, and a 26-bit PAL- 

pipeline such that all instructions are guaranteed to godj function field <25:0> specifies the operation, 

complete without an exception in the absence of a ma- -j^g source and destination operands for PALcode in- 

chine check. structions are supplied in fixed registers that are speci- 

Referring to FIG. 8, the formaU of the various types 45 fjej jn the individual instruction defmitions. 

of instructions of the instruction set executed by the xhe six-bit opcode field <31:26> in the instruction 

CPU 10 of FIGS. 1-7 are illustrated. One type is a formats of FIG. 8 allows only 2* or si;^ty-foiur different 

memory instruction 70, which contains a 6-bit opcode in instructions to be coded. Thus the instruction set would 

bits <31:26>, two 5-bii register address fields Ra and be limited to sixty-four. However, the "function" fields 

Rb in bits <25:21> and <20:16>, and a 16-bit signed so in the instruction formats 72, 73 and 74 allow variations 

displacement in bits <lS.-0>. This instruction is used to of instructions having the same opcode in bits 

transfer data between registere 43 and memory (mem- <31J<>. Also, the "hint" bits in the jump instruction 

ory 12 or caches 59 or 20), to load an effective address allow variations such as JSR. RET, as explained below, 

to a register of the register file, and for subroutine Referring to FIG. 9, the format 76 of the virtual 

jumps. The displacement field <1S:0> is a byte offset; jj address asserted on the internal address bus 56 is shown, 

it is sign-extended and added to the contents of register This address is nominally 64-bits in width, but of course 

Rb to form a virtual address. The virtual address is used practical implementations within the next few years v^l 

as a memory load/store address or a result value de- use much smaller addresses. For example, an address of 

pending upon the specific instruction. 43-bits provides an addressing range of 8-Terabytes. 

The branch instruction format 71 is also shown in 60 The format includes a byte offset 77 of, for example, 
FIG. 8, and includes a 6-bit <qx:ode in bits <31 J6>, a 13-bit8 to 16-bitt in size, depending upon the page size 
5-bit address field in bits <25:21 > , and a 21-bit signed employed. If pages are 8 -Kbytes, the byte-within-page 
branch displacement in bits <20:0>. The displacement field 77 is 13-bits, for 16 -Kbyte pages the field 77 is 
is treated as a longword offset, meaning that it is shifted 14-bits, for 32-Kbyte pages it is IS-bits, and for 64- 
left two bits (to address a longword boundary), sign- 63 Kbyte pages it is 16-bits. The format 76 as shown in- 
extended to 64-bits and added to the updated contenu eludes three segment fields 78, 79 and 80, labelled Segl, 
of PC 33 to form the target virtual address (overflow is Seg2 and Seg3, also of variable size depending upon the 
ignored). implementation. The segments Segl, Seg2. and Seg3 
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can be IO-to-13 bits, for example. If each segment size is 
10-bits, then a segment defined by Seg3 is IK pages, a 
•egment for Seg} is IM pages, and a segment for Segl 
is IG pages. Segment number fields Segl, Seg2 and 
Seg3 are of the same size for a given implementation. 3 
The segment number fields are a function of the page 
size; all page table entries at a given level do not exceed 
one page, so page swapping to access the page table is 
tniTiir niTcd. The page frame number (PFN) field in the 
PTE is always 32-bits wide; thus, as the page size grows 10 
the virtual and physical adidress size also grows. 

The physical addresses are at most 48-bits, but a pro- 
cessor may implement a smaller physical address space 
by not implementing some number of high-order bits. 
The two most significant implemented physical address 13 
bits select a caching policy or implementation-depend- 
ent type of address space. Different implementations 
may put different uses and restrictions on these bits as 
appropriate for the system. For example, in a worksta- 
tion with a 30-bit <294l> physical address space, bit 20 
<29> may select between memory and I/O and bit 
<28> may enable or disenable caching in I/O space 
and must be zero in memory space. 

Typically, in a multiprogramming system, several 
processes may reside in physical memory 13 (or caches) 23 
at the same time, so memory protection and multiple 
address spaces are used by the CPU 10 to ensure that 
one process will not intofere with either other pro- 
cesses or the operating system. To further improve 
software reliability, four hierarchical access modes pro- 30 
vide memory access control. They are, from most to 
least privileged: kernel, executive, supervisor, and user. 
Protection is specified at the individual page level, 
where a page may be inaccessible, read-only, or read/- 
write for each of the four access modes. Accessible 33 
pages can be restricted to have only data or instruction 
access. 

A page table entry or PTE 81, as stored in the transla- 
tion buffers 36 or 48 or in the page tables set up in the 
memory 12 by the operating system, is illustrated in 4o 
FIG. 10. The PTE 81 is a quadword in width, and 
includes a 32-bit page frame number or PFN 82 at bits 
<<3J2>, as well as certain sofhvare and hardware 
control information in a field 83 having bits <I5K)> as 
set forth in Table A to implement the protection fea- 43 
tures and the like. 

A particular feature is the granularity hint 84 in the ' 
two bits <6:S>. Software may set these bits to a non- 
zero value to supply a hint to the translation buffer 36 or 
48 that blocks of pages may be treated as a larger single 30 
page. The block u an aligned group of 8^ pages, where 
N is the value of PTE<<:S>, e.g., a group of 1-, 8-, 64-, 
or 912-pages starting at a virtual address with (pagesi- 
ze -t- 3N) low-order zeros. The block is a group of physi- 
cally condguous pages that are aligned both virtually S3 
and physically; within the block, the low 3N bitt of the 
PFNs describe the identity mapping (i.e., are used as 
part of the physical address by adding to the byte-with- 
in-page field) and the high (32- 3N) PFN bits are all 
equal. Within the block, all PTEs have the same values (O 
for bits <15K)>, i.e., the same protection, fault, granu- 
larity, and valid bits of Table A. Hardware may use this 
hint to map the entire block with a single TB entry, 
instead of eight, sixty-four or 512 separate TB entries. 
Note that a granularity hint might be appropriate for a 63 
large memory structure such as a frame buffer or non- 
paged pool that in fact is mapped into contiguous vir- 
tual pages with identical protection, fault, and valid bits. 



167 
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An example of the use of the granularity hint is the 
storage of a video frame for a display; here the block of 
data defming one frame may occupy sixty-four 8 KB 
pages for a high-resolution color display, and so to 
avoid uang sixty-four page table entries to map the 
physical addresses for this frame, one can be used in- 
stead. This avoids a large amount of swapping of PTEs 
from physical memory 12 to TB 48 in the case of a 
reference to the frame buffer to draw a vertical line in 
the screen, for example. 

Referring to FIG. 11, the virtual address on the bus 
56 is used to search for a PTE in the TB 48, and, if not 
found, then Segl field 18 is used to index into a first 
page table 8S found at a base address stored in an inter- 
nal register 86. The entry 87 found at the Segl index in 
table 85 is the bate address for a second page table 88, 
for whkh the Seg2 field 79 is used to index to an entry 

89. The entry 89 points to the base of a third page table 

90, and Seg3 field 80 is used to index to a PTE 91, which 
is the physical page frame number combined with the 
byte offset 77 fhim the virtual address, in adder 92, to 
produce the physical address on bus 54. As mentioned 
above, the size of the byte offset 77 can vary depending 
upon the granularity hint 84. 

Using the instruction formats of FIG. 8, the CPU of 
FIG. 1 executes an instruction set which includes nine 
types of instructions. These include (I) integer load and 
store instructions, (2) integer control instructions, (3) 
integer arithmetic, (4) lo^cal and shift instructions, (S) 
byte manipulation, (6) floating point load and store, (7) 
floating point control, (8) floating point arithmetic, and 
(9) miscellaneous. 

The integer and load and store instructions use the 
memory format 70 of FIG. 8 and include the following: 



LDA 


Load Addreu 


LDAH 


Load Addro* High (ihift high) 


LDL 


Load Sign Exiended Longword 


LDQ 


Load Quadwcrd 


LDL_L 


Load Sign Extended Losgword Locked 


LDQ_L 


Load Quadword Locked 


LDQ_U 


Load Quadword Unaligned 


STL 


Store Longword 


STQ 


Store Quadword 


STl_C 


Store Longword Conditional 


STQ_C 


State Quadword Conditional 


STQ_U 


Store Quadword Unaligned 



For each of these the virtual address is computed by 
adding register Rb to the sign-extended 16-bit displace- 
ment (or 61536 times the sign-extended displacement 
for LDAH). 

For load instructions LDL and LDQ the source op- 
erand is fetched from memory at the computed address, 
sign extended if a longword, and written to register Ra. 
If the data is not naturally aligned an alignme nt e xcep- 
tion is generated. For the store instructions STL and 
STQ the content of register Ra is written to memory at 
the computed virtual address. The load address instruc- 
tions LDA and LDAH are like the load instructions 
LDL and LDQ, but the operation stops after the ad- 
dress is computed; the 64-bit computed virtual address 
is written to register Ra. 

The Load Locked and Store Conditional instructions 
(LDL_L, LDQ_L, STL L AND STQ_L) together 
provide an important feature of the architecture herein 
described. Particularly, this combination of instructions 
serves to insure data integrity in a multiple processor or 
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IS 



fripelined processor system by providing an atomic 
update of a shared memory location. As in the other 
instructions of this type, the virtual address is computed 
by adding the contents of the register Rb specified in 
the instruction to the sign-extended 16-bit displacement 
given in the instruction. When a LDL_L or LDQ— L 
instruction is executed without faulting, the CPU 10 
records the target physical address from bus S4 to a 
locked physical address register 95 of FIG. 5, and sets a 
lock flag 96. If the lock flag 96 is still set when a store 
conditional instruction is executed, the store occurs, Le., 
the operand is written to memory at the physical ad- 
dress, and the value of the lock flag 96 (a one) is re- 
turned in Ra and the lock flag set to zero; otherwise, if 
the lock flag is zero, the store to memory does not 
occur, and the value returned to Ra is zero. 

If the lock flag for the CPU 10 is set, and another 
CPU 15 does a store within (he locked range of physical 
addresses in memory 12, the lock flag 96 in CPU 10 is 
cleared. To this end, the CPU 10 monitors all writes to 20 
memory 12 and if the address in register 95 is matched, 
the flag 96 is cleared. The locked range is the aligned ^ 
block of 2^^ bytes that includes the locked physical ad- 
dress in register 95; this value 2^ may vary depending 
upon the construction of a CPU, and is at least eight 
bytes (minimum lock range is an aligned quadword- 
)— the value is at most the page size for this CPU (maxi- 
mum lock range is one physical page). The lock flag 96 
of a CPU 10 is also cleared if the CPU encounters any 
exception, interrupt, or a call PALcode instruction. 

The instruction sequence 



branch to subroutine, and a jump to subroutine instruc- 
tion, all using the branch instruction format 71 or niefti- 
ory instruction format 70 of FIG. 8. These control in- 
structions are: Using branch instruction format: 



BEQ 


Branch if Refiner Eqvti to Zero 


BNE 


Branch if Register Not Eqtol 10 Zero 


BLT 


Branch if Regiiter Lot Thu Zero 


BLE 


Branch if Refiner Leu Thu or Equal to Zero 


BOT 


Bnnch if Reginer Otaler Than Zero 


BGE 


Branch if Reginer Orcaicr Than or Equal to Zero 


BLSC 


Branch if Reginer Low Order Bit i> Oear 


BLBS 


Branch if Register Low Older Bit b Set 


BR 


UoGOoditiaaal Branch 


BSR 


Branch 10 Subroutine 



Using memory instruction fbrmat: 



JMP 

JSR 
RET 

JSR_COROimNE 



Jump 

Jump to Subroutine 
Return from Subroutine 
Jump to Subroutiiie Return 



23 



30 



UXJ-L 
modiry 
STQ_L 
BEQ 



35 



executed on the CPU 10 does an atomic read-modify- 

write of a datum in shared memory 12 if the branch falls ^ 

through; if the branch is taken, the store did not modify 40 followed by loading the PC with the target virtual ad 



For the conditional branch instructions, the register ' 
Ra is tested, and if the specified relationship is true, the 
PC is loaded with the target virtual address; otherwise, 
execution continues with the next sequential instruction. 
The displacement for either conditional or uncondi- 
tional branches is treated as a signed longword offset, 
meaning it is shifted left two bits (to address a longword 
boundary), sign-extended to 64-bits, and added to the 
updated PC to form the target virtual address. The 
conditional or unconditional branch instructions are 
PC-relative only, the 21-bit signed displacement giving 
a forward/backward branch distance of +/— IM long- 
words. 

For the unconditional branch instructions BR or 
BSR, the address of (he instruction following the BR or 
JMP (i.e., the updated PC) is written to register Ra, 



SO 



the location in memory 12 and so the sequence may be 
repeated until it succeeds. That is, the branch will be 
taken if register Ra is equal to zero, meaning the value 
of the lock flag returned to Ra by the store conditional 
instruction is zero (the store did not succeed). This 43 
tnstnictiott sequence is shown in more detail in Appen- 
dix A. 

If two load locked instructions are executed with no 
intervening store conditional, the second one over- 
writes the sute of the first in lock flag 96 and register 
95. If two store conditional instnictions execute with no 
intervening load locked instruction, the second store 
always fails because the first clears the lock flag 96. 

The load unaligned instructions LDQ— U and 

LDI U are the same as a load LDQ or LDL, but the SS 

low-order 3-bits of the virtual address are cleared (the 
load unaligned instnictioiis ire used for byte addresses), 
so an aligned quadword or longword is fetched. Also, 
no alignment fault is signalled, as it would be for a sim- 
ple LDQ or LDL instruction if a byte address (un- 
aligned address), were seen. A load unaligned instruc- 
tion b used for byte manipulation as will be described 
below. The store unaligned instruction STQ— U is like- 
wise similar to the STQ instruction, but it removes the 
low-order three bits of the virtual address, and does not 
signal a fault due to the unaligned address. 

The control type of instructions include eight condi- 
tional branch instructions, an unconditional branch. 



dress. BR and BSR do identical operations; they only 
differ in hints to branch-prediction logic — BSR is pre- 
dicted as a subroutine call (pushes the return address on 
a branch-prediction stack), while BR is predicted as a 
branch (no push). 

For the jump and return instructions, the address of 
the instruction following this instruction (the updated 
PC) is written to register Ra, followed by loading the 
PC with the target virtual address. The new PC is sup- 
plied from register Rb, with the two low-order bits of 
Rb being ignored. Ra and Rb may specify the same 
register; the target calculation using the old value is 
done before the assignment of the new value. 

All four instructions JMP, JSR, RET and 
JSR— COROUTINE do identical operations; they only 
differ in hints to branch-prediction logic. The displace- 
ment field of the instruction (not being used for a dis- 
placement) is used to pass this information. The four 
^ different "opcodes" set difTerent bit patterns in dis- 
p<lS:14>, and the hint operand sets disp<13:0>. 
These bits are intended to be used as follows: 



63 



disp 
<15:U> 



meanmg 



Predicted 
Target <130> 



Prediction 
Stack 

Action 



00 
01 



JMP 
JSR 



PC+{4«di«p<13fl>) 
PC+{4'disp<I}<») 



puih PC 
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-continued 



disp 
<I5:U> 



Predicted 
Target <I5:0> 



Prediction 

Suck 

Action 



10 
II 



RET Pndictioo nack 
JSR_00 Pndiction lUck 



pop 

pop, push PC 



10 
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This construction allows specification of the low 16-bits 
of a likely longword target address (enough bits to start 
a useful instruction cache 31 access early), and also 
allows distinguishing call from return (and from the 
other less frequent operations). Note that the informa- 
tion according to this table can only be used as a hint; 
correct setting of these bits can improve performance 
but is not needed for correct operation. 

Thus, to allow the CPU io achieve high performance, 
explicit hints based on a branch-prediction model are 
provided as follows: 

(1) For many implementations of computed branches 20 
(JSR, RET, JMP), there is a substantial performance 
gain in forming a good guess of the expected target 
instruction cache 21 address before register Rb is ac- 
cessed. 

(2) The CPU may be constructed with the first (or 25 
only) instruction cache 21 being small, no bigger than a 
page (B-64 KB). 

(3) Correctly predicting subroutine returns is impor- 
tant for good p«formance, so optionally the CPU may 
include a small stack of predicted subroutine return 30 
instruction cache 21 addresses. 

To this end, the CPU 10 provides three kinds of 
branch-prediction hints: likely target address, return- 
address stack action, and conditional branch taken. 

Forcomputed branches (JSR/RET/JMP), otherwise 33 
unused displacement bits are used to specify the low 
16-bits of the most likely target address. The PC-rela- 
tive calculation using these bits can be exactiy the PC- 
relative calculation used in conditional branches. The 
low 16-bits are enough to specify an instruction cache 40 
21 block within the largest possible page and hence are 
expected to be enough for the branch- prediction logic 
to start an early instruction cache 21 access for the most 
likely target. 

For all branches, hint or opcode bits are used to dis- 43 
tinguisb simple branches, subroutine calls, subroutine 
returns, and coroutine links. These distinctions allow 
the branch-prediction logic to maintain an accurate 
■tack of predicted return addresses. 

For conditional branches, the sign of the target dis- SO 
placement is used by the branch-prediction logic as a 
taken/fall-through hint. Forward conditional branches 
(positive displacement) are predicted to fall through. 
Backward conditional branches (negative displace- 
ment) are predicted to be taken. Conditional branches » 
do not affect the predicted return address stack. 

The integer arithmetic instructions perform add, sub- 
tract, multiply, and signed and unsigned compare oper- 
ations on integers of registers 43, returning the result of 
•n integer register 43. These instructions use either of ^ 
the integer operate formats of FIO. 8 (three-register, or 
two-register and literal) and include the following: 



CMFULT Comptre l^niitned Qiudwon) Los Hun 

CMFULE Ccmpire Unsigned QuMlvrard Less Thin or EquJ 

MirU. Multiply Longword 

MULQ Multiply Quutwonl 

UMULH Ussii^ Qusdwoid Multiply High 

SUBL Subutcl Longword 

SUBL Snbtnci Qusdword 



The ADDL instructions, register Ra h added to reg- 
ister Rb or to a literal, and the sign-extended 32-bit sum 
b written to register Rc; the high-order 32-bits of Ra 
and Rb are ignored. For ADDQ instructions, register 
Ra is added to renter Rb or to a literal, and the 64-bit 
sum is written to Rc. The iinitign>H compare instruc- 
tions can be used to test for a carry; after adding two 
values using ADD, if the unsignnJ sum is less than 
either one of the inputs, there was a carry out of the 
most significant bit. 

For the compare instructions, register Ra is com- 
pared to register Rb or a literal, and if the specified 
relationship is true the value one is written to the regis- 
ter Rc; otherwise, zero is written to register Rc. 

The multiply instructions cause the register Ra to be 
multiplied by the contents of the register Rb or a literal 
and the product is written to register Rc. For MULL, 
the product is a 32-bit sign-extended value, while 
MULQ results in a 64-bit product. For the unsigned 
quadwbrd multiply high instruction UMULH, register 
Ra and Rb or a literal are multiplied as unsigned num- 
bers to produce a 128-bit result; the high-order 64-bits 
are written to register Rc. 

For the subtract instructions, the register Rb or a 
literal is subtracted from the register Ra and the differ- 
ence is written to the destination register Rc. The differ- 
ence is a sign-extended 32-bit value for SUBL, or a 
64-bit value for SUBQ. The unsigned compare instruc- 
tions can be used to test for a borrow; if the unsigned 
minuend (Ra) is less unsigned than the unsigned subtra- 
hend (Rb), there will be a borrow. 

The logical instructions are of the operate format and 
perform quadword Boolean operations. These instruc- 
tions are as follows: 



AND 


Loffol Product 


BIS 


Logici] Sum 


XOR 


Logictl Diflerence 


BIC 


Logic*] Product with Cotnplcnent 


ORNOT 


Logic*] Sum witb Complenent 


EQV 


Logical Equivslencc 



These instructions perfo r m the designated Boolean 
function between register Ra and register Rb or a tit- 
eral, and write the result to the destination register Rc. 
The '^OT' flmction can be performed by doing an 
ORNOT with zero (Ra^^RSl). 

The shift instructions are of the operate format and 
perform left and right logical shift and right arithmetic 
shift in the shifter 41, as follows: 



SLL 
SRL 
SRA 



SbiA Left Logical 
Shift Right Logical 
Shift Right Arithmetic 



AODL Add Longword 

AODQ Add Quadword 

CMPEQ Compare Sigocd Quadword Equal 

CMPLT Compare Signed Quadword Lett Than 

CMPLE Compare Signed Quadword Lett Than or Equal 



69 



There is no arithmetic left shift instruction because, 
typically, where an arithmetic left shift would be used, 
a logical shift will do. For multiplying a small power of 
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two in address computations, logical left shift is act^t- 
able. Arithmetic left shift is more complicated because it 
requires overflow detection. Integer multiply should l>e 
used to perform arithmetic left shift with overflow 
checking. Bit field extracts can be done with two logical } 
shifts; sign extension can be done with left logical shift 
and a right arithmetic shift. For the logical shifts, the 
register Ra is shifted logically left or right O-to-63 bits 
by the count in register Rb or a literal, and the result is 
written to the register Rc with zero bits propagated to 
into the vacated bit positions. Likewise, for the shift 
right arithmetic instruction, the register Rb is right 
shifted arithmetically O-to-63 bits by the count in the 
register Ra or a literal, and the result written to the 
register Rc, with the sign bit (Rbv<63>) propagated is 
into the vacated bit positions. 

An important feature which allows improved perfor- 
mance is the conditiona] move integer CMOV instruc-. 
tion. These instructions perform conditionals without a 
branch, and so maintain the sequentiality of the instruc- 20 
tion stream. These instructions are of the operate for- 
mat, and include: 



CMOVEQ 
CMOVNE 
CMOVLT 
CMOVLE 

CMOVOT 
CMOVOE 

CMOVLBC 
CMOVLBS 



Conditiona] Move 
Conditioni] Move 
Cooditiontl Move 
Conriitimul Move 
to Zero 

Conditioni] Move 
Conditioni] Move 
Eq^ioi to Zero 
Conditioni] Move 
Conditioni] Move 



if Register Vow Bit Qeir 
if Regi«ler Low Bii Set 



BNE 
OR 



Rijsbel 

R3l,Rb.Rc 



except that the CMOV way b likely in many implemen- 
tations to be substantially faster. A branchless sequence 
for finding the greater of the contents of two. registers, 
Rl=MAX(Rl,R2)is: 



CMPLT R1JU,R3 
CMOVNE R).R2.RI 



IR3-1 ifRKRI 

I Do DoUiing if N0T(R1 <R2) 

I Move R2 to Rl if RKR2 



operations in memory, but yet constrain the memory 
accesses to full quadword aligned boundaries. The byte 
manipulation instructions are of the operate format 72 
or 73 of FIG. 8 and include compare byte, extract byte, 
mask byte, and zero byte instructions as follows: 



if Register Equi] to Zero 

if Register Not Equal to Zero 23 

if Register Lett Tlian to Zero 

if Register Less Thm or Equal 

if Register Crater Than Zero 
if Register Oretter Than of 



30 



In executing these conditional move instnictions, the 
register Ra is tested, and if the specifled relationship is 33 
true, the value in register Rb is written to the register 
Rc. The advantage of having this alternative is in execu- 
tion speed. For example, an instruction CMOVEQ 
Ra,Rb4l.c is exactly equivalent to 

40 



4S 



50 



Of course, the advantage of not using branches is that SS 
the instruction stream is fetched sequentially, and there 
is no need to flush the instruction cache or prefetch 
queue. A conditional move is faster than a branch even 
if the branch is predicted correctly. If the branch is not 
predicted correctly, the conditional move is much faster 60 
because it eliminates a branch operation. 

Another important feature is providing instructions 
for (grating on byte operands within registers. These 
allow full-width 64-bit memory accesses in the load/- 
store instructions, yet combined with a variety of in- 63 
register byte manipulations a wide variety of byte oper- 
ations are possible. The advantage is that of l>eing able 
to use code written for architectures which allow byte 





Compare byte 


EXTBL 




EXTWL 


EslTM^f word kiw 


EXTLL 


Extrvct luugwotd low 


EXTQL 


Extract quadword low 


EXTWH 


Eiiraci word high 


EXTLH 


Extract longword high 


EXTQH 


Extract quadword high 


INSBL 


loictt b>le low 


INSWL 


Imcrt word low 


INSLL 


Insert kagword low 


INSQL 


Insen quadword low 


INSWH 


Insert word high 


INSLH 


Insert loogword high 


INSQH 


ItDcrt quadword high 


MSKBL 


Mask byte low 


]^KWL 


Mask word low 


MSKLL 


Mask longword low 


MSKQL 


Mask quadword low 


MSKWH 


Mask word high 


MSKLH 


Mask kngword high 


MSKQH 


Mask quadword high 


ZAP 


Zero bytes 


ZAPNOT 


Zero bytes not 



The compare byte instruction does eight parallel 
unsigned byte comparisons between corresponding 
bytes of the registers Ra and Rb (or Ra and a literal), 
storing the eight results in the low eight bits of the 
register Rc; the high S6-bits of the register Rc are set to 
zero. Bit-0 of Rc corresponds to byte-0, bit-1 of Rc to 
byte-1, etc. A result bit is set in Rc if the corresponding 
byte of Ra is greater than or equal to Rb (unsigned). 

The extract byte instructions shift register Ra by 0-7 
bytes (shift right for low, shifts lefl for high), then ex- 
tract one, two four or eight bytes into the register Rc, 
with the number of bytes to shift being specifled by bits 
<2K» of the register Rb, and the number of bytes to 
extract being specified in the function code; remaining 
bytes are filled with zeros. The extract byte high in- 
structions shift left by a number of bytes which is eight 
minus the amount specified by bits <2:0> of register 
Rb. These extract byte instructions are particularly 
useful in byte manipulation where a non-aligned multi- 
byte datum in inemory is to be operated upon, as set 
forth in the examples for byte extract in the Appendix. 

The insert byte instructions shift bytes from the regis- 
ter Ra and insert them into a fleld of zeros, storing the 
result in the register Rc; register Rb, bits <2:0>, se- 
lects the shift amount of 0-7 bytes, and the function 
code selects the field width of one, two, four or eight 
bytes. These insen byte instructions can generate byte, 
word, longword or quadword datum that is placed in 
the register<s) at an arbitrary l>yte alignment. 

The byte mask instructions MSKxL and MSKxH set 
■elected bytes of register Ra to zero, storing the result in 
register Rc; register Rb<2.-0> sclecu the starting posi- 
tion of the fleld of zero bytes, and the function code 
selects the maximum width, one, two four or eight 
bytes. The mask instnictions generate a byte, word, 
longword or quadword field of zeros that can spread 
across to registers at an arbitrary byte aUgnment. 

The zero bytes instnictions ZAP and ZAPNOT set 
selected bytes of register Ra to zero, storing the result in 
register Rc; register Rb<7KI> selects the bytes to be 
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zeroed, where bit-0 of Rb oonesponds to byte^, biM of 
Rb corresponds to byte-1, etc. A result byte is set to 
zero if the corresponding bit of Rb is a one for ZAP and 
a zero for ZAPNOT. 

In Appendix A, instruction sequences are given to i 
illustrate how byte operations can be accomplished 
using the byte instructions set forth above. 

The floating point instructions operate on floating 
point operands in each of live data formats: (1) F— float- 
ing, which is VAX single precision; (2) D_floating, 
which is VAX double precision with an 8-bit exponent; 
(3) G— floating, which is VAX double precision, with 
an 11-bit exponent; (4) S-floating, which b IEEE sin- 
gle precision; and T— floating, which is IEEE double 
precision, with an 1 1-bit exponent The single precision 
values are loaded to the upper 32-bits of the 64-bit regis- 
ters 61, with the lower 32-bits being zeros. Data conver- 
sion instrtictions are also provided to convert operands 
between floating-point and quadword integer formats, 
between single and double floating, and between quad- 
word and longword integers. There b no global float- 
ing-point processor state for the CPU 10; i.e., the ma- 
chine sute is not switched between data formats, but 
instead the choice of data formats is encoded in each 
instruction. 

Floating point numbers are represented with three 
fields: sign, exponent and iraction. The sign field is one 
bit, the exponent field is eight or eleven bits, and the 
fraction is 23-, 32- or SS-bits. Several different rounding jg 
modes are provided; for VAX formats, rounding is 
normal (biased) or chopped, while for IEEE formats 
rounding is of four types, normal (unbiased round to 
nearest), rounding toward plus infmity, rounding 
toward minus infmity, and round toward zero. There 33 
are six exceptions that can be generated by floating 
point instructions, all signalled by an arithmetic excep- 
tion trap, these exceptions are invalid operation, divi- 
sion by zero, overflow, underflow, inexact result and 
integer overflow. 40 



FBEQ 


FVMtijig BiiDCb Equa] 


FBNE 


Floatiag Bnnch Not Equal 


FBLT 


Rouisg Branch Less Thmn 


FBLE 


FlasttDg Branch Leu Hun or Equal 


FBOT 


FIo«tm( Branch Cfcttcr Than 


FBOE 


Flouing Branch Orcater Than o> Equal 



2S 



Register Fa is tested, and if the specified relationship is 
true, the PC is loaded with the target virtual address; 
otherwise, execution continues with the next sequential 
instruction. The displacement is treated as a signed 
longword offset, meaning it is shifted left two bits to 
address a longword boundary, sign-extended to 64-bits, 
and added to the updated PC to form the target virtual 
address. 

The operate format instructions for floating point 
arithmetic include add, subtract, multiply, divide, com- 
pare, absolute value, copy and convert operations on 
64-bit register values in the register 61. Each instruction 
specifies the source and destination formats of the val- 
ues, as well as rounding mode and trapping modes to be 
used. These floating point operate instructions are listed 
in Table B. 

The floating point conditional move instructions cor- 
respond to the integer conditional move instructions, 
except floating point registers 61 are used instead of the 
integer registers 43. As with the integer conditional 
move, these instructions can be used to avoid branch 
instructions. 

The CPU 10 has several "miscellaneous" instructions 
in its instruction set, all using the instruction formats 
above, but not fitting into the categories discussed thus 
far. The following are the miscellaneous instructions: 



CALI—PAL 


Call Privileged Architecture Library Routine 


FETCH 


Prefetch Dau Block 


FETCH_M 


Prefetch. Modify Intent 


DRAINT 


Drain lonroction Pipeline 


MB 


Memory Barrier 


RCC 


Read Cycle Counter 



LDF IxMd F— lloatiog 

LDD Load O— floating (Load O— floatiag) 

LDS Load S-floa t i n g (L.aad Longword Integer) 

LDT Load T_floating (Load Quadword Integer) 43 

STF Store F_floating 

STD Store D_floaiing (Store O— iloatmg) 

STS Store S_flaating (Store L.otigword Integer) 

STT Store T_floaiing (Store Quadword Integer) 



Each of the load instructions fetches a floating point 
datum of the specified type from memory, reorders the 
bytes to conform to the floating point register format 
for this type, and writes it to the register Fa in register 
set 61, widi the virtual address being computed by add- 55 
mg the register Fb to the sign-extended 16-bit displace- 
ment. The store instructions cause the contents of regis- 
ter Fa to be stored in the memory location at a virtual 
address computed by adding register Rb to the sign- 
extended 16-bit displacement, with the bytes being reor- 
dered on the way out to conform to the memory format 
for this floating point data type. 

Hie floating point branch instructions operate in the 
same manner as the integer branch instructions dis- 
cussed above, i.e., the value in a floating point register 6S 
Fa is tested and the PC is conditionally changed. These 
floating point branch instructions include the following: 



The CALI PAL instruction using format 75 of 

FIG. 8 causes a trap to the PAL code (bits <25:0> of 
the instruction). This instruction is not issued until all 
previous instructions are guaranteed to complete with- 
out exceptions; if all exception occurs for one of these 
previous instructions, the continuation PC in the excep- 
tion stack frame points to the CALI Pal instruction. 

The FETCH instruction prefetches an aligned 312- 
byte block surrounding the virtual address given by the 
contents of Rb. This address in Rb is used to designate 
an aligned SI2-byte block of data. The operation is to 
attempt to move all or part of the 312-byte block (or a 
larger surrounding block) of data to a faster-access part 
of the memory hierarchy, in anticipation of subsequent 
Load or Store instructions that access the data. The 
FETCH instruction is thus a hint to the CPU 10 that 
may allow faster execution. If the construction of the 
particular CPU does not implement this technique, then 
the hint may be ignored. The FETCH—M instruction 
gives an additional hint that modifications (stores) to 
some or all of the data is anticipated; this gives faster 
operation in some writeback- cache designs because the 
data block will be read into the cache as "owned" so 
when a vmte is executed to the data of the block in the 
cache it will not generate a fault to go off and claim 
ownership. No exceptions are generated by FETCH; if 
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Pigc Ttbk Entry 
Fitldi to Ihe kmc ttble entry we Imtrprtitd ii follow: 
Deicription 



<0> 
<1> 



<:> 



<J> 



<4> 



<&5> 

<7> 

<«> 



<9> 



<10> 



VaUd (V) - Indicates the validity of the PFN field. 
Fault Od Read (FOR) • When let. a Fault On Read 
exception occurt on an ittempt to read any location in 
the page. 

Fault On Write (FOW) . When «et, a Fault On Write 
exception occun on an attempt to write any location in 
the page. 

Faoh oo Eiecute (FOE) - When aet. a Fault On Exe- 
cute exception occun on an attempt to eiecuu an in- 

■tiuction in the page. 

Address Space Match (ASM) ■ When set, this PTE 
natckea all Addrea Space Numben. For a given VA, 
ASM must be aet conidsiently in all prooesies. 
Oraaularity hint (OH). 
Reserved for future use. 

Kernel Read EaaUe (KRE) - This bit enables reads 
from kernel mode. If Ihii bit is a 0 and a LOAD or in- 
stniction fetch b attempted while in keiiiel mode, an 
Access Violation occurs. This bit is valid even when 
V-0. 

Executive Read Enable (ERE) ■ This bit enables reads 
trom executive mode. If this bit is a 0 and a LOAD or 
instniction fetch is attempted while to executive mode, 
aa Access Violation occurs. This bit is valid even when 

v>a 

Supervisor Read Enable (SRE) - This bit enables reads 
from supervisoc mode. If this bit is a 0 and a LOAD or 
instruction fetch is attempted while in supervisor mode, 
an Acoea Violation occurs. This bit a valid even when 
V-0. 



10 



a Load (or Store in the case of FETCH_M) using the 
same address would fault, the prefetch request is ig- 
nored. The FETCH instniction is intended to help soft- 
ware bury memory latencies on the order of 100-cycles; 
h is unlikdy to matter (or be implemented) for memory 
latencies on the order of lO-cydes, since code schedul- 
ing should be used to bury such short latencies. 

The DRAINT instruction stalls instruction issuing 
nntil all prior instructions are guaranteed to complete 
without incurring arithmetic traps. This allows soft- 
ware to guarantee that, in a pipelined implementation, 
all previous arithmetic instructions will complete with- 
out incurring any arithmetic traps before any instruc- 
tion after the DRAIhfT are issued. For example, it 
should be used before changing an exception handler to 
ensure that all exceptions on previous instructions are 
prt>cessed in the current exception-handling environ- 
ment. 

The memory barrier instruction MB guarantees that 
all future loads or stores will not complete until after all 20 
previous loads and stores have completed. In the ab- 
sence of an MB instruction, loads and stores to different 
physical locations are allowed to complete out of order. 
The MB instruction allows memory accesses to be seri- 
alized. 

The read cycle counter instruction RCC causes the 
register Ra to be written with the contents of the CPU 
cycle counter. The lower order 32-bits of the cycle 
counter in an unsigned integer that increments once per 
N CPU cycles, where N is an implemenution-specific 
integer in the range l-to-16. The counter wraps around 
to zero at an implementation-speciflc value. 

While this invention has been described with refer- 
ence to speciric embodiments, this description is not 
meant to be construed in a limiting sense. Various modi- 
fications of the disclosed embodiments, as well as other 
embodiments of the invention, will be apparent to per- 
sons skilled in the art upon reference to this description. 
It is therefore contemplated that the appended claims 
will cover any such modiftcations or embodiments as 
fall within the true scope of the invention. 

TABLE A 
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TABLE A-continued 



Page Table Entry 
Fields in the pa«e table entry are interpreted as follows: 
Bits Description 



<ll> 



<I3> 



<I5> 



<31:I6> 
<63:32> 
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User Read Enable (URE) • This bit enables reads from 
user mode. If this bh is a 0 and a LOAD or instruction 
fetch is attempted while in user mode, an Access Viol- 
ation occurs. This bit is valid even when VoO. 
Kernel Write Enable (KWE) • Tbb bit enables writes 
(rom kemel mode. If this bit b a 0 and a STORE is 
attempted while in kernel mode, an Access Violation 
occurs. This bit b valid even when V— 0. 
Executive Write Enable (EWE) • Thb tni enables 
writes from executive mode. If thu bit b a 0 and a 
STORE b attempted while m cxecuo've mode, an 
Access Violation occurs. 

Supervisor Write Enable (SWE) • Thb bit enables 
writes from aupervoor mode. If thb bit b a 0 and a 
STORE b attempted while to supervisor mode, an 
Access Violation occurs. 

V*a Write Enable (tJWE) • Thb bit enables writes 
from user mode. If thb bit u a 0 and a STORE u 
attempted while in user mode, an Access Violation 
occurs. 

Reserved for software. 

Page Frame Number (FFN) - The PFN field always 
poinu to a page boundary. If V b set, the PFN b con- 
catenated with Ihe Byte Within Page bits of the vinual 
address to obtain the physical address. If V b clear, thu 
field may be used by software. 



TABLE B 



Floating Point Arithmetic Operations 
Mnemonic 



40 
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35 



60 



63 





Bit otieration 


CPYS 


C^y Sign 


CPYSN 


Copy Sign Negate 


CPYSE 


Copy Sign and Exponent 


CPVSEE 


Copy Sign and Extended Exponent 


CVTQL 


Convert (}uadword to Longword 


CVTLQ 


Convert Longword to (}uadword 


FCMOV 


Floating Conditional Move 




Arithmetic oneration 


ADDF 


Add F—floating 


ADDD 


Add D_11oating 


ADDG 


Add C— floating 


ADDS 


Add S_iloaling 


ADDT 


Add T_iloating 


CMPD 


Compare D— floating 


CMPG 


Compare 0_floating 


CMPS 


Compare S_/loating 


CMPT 


Compare T_iIoaiing 


CVTDQ 


Convert D_noating to Quadword 


CVTGQ 


Convert O— floating to Quadword 


CVTSQ 


Convert S_floating to (}iiadword 


CVTTQ • 


Convert T_floating to Quadword 


CVTQD 


Convert Quadword to D_floating 


CVTQF 


Convert Quadword to F_floating 


CVTCJG 


Convert Quadword to O —floating 


CVTQS 


Convert Quadword to S— flnaling 


CVTQT 


Convert Quadword to T_floating 


CVTFG 


Convert F_ftoating to 0_flaatmg 


CVTDF 


Convert D_flaating to F_floating 


CVTOF 


Omvert G— floating to F_flaatmg 


CVTST 


Convert S— floating to T_floating 


CVTTS 


Convert T—floating to S_floating 


DIVF 


Divide F_floating 


DIVD 


Divide D_floating 


DIVO 


Divide 0_floating 


DIVS 


Divide S—floating 


DIVT 


Divide T_floating 


MULF 


Multiply F_fIoating 


Min.D 


Multiply D_Jloattng 


Min.o 


Multiply 0_floating 


MULS 


Multiply S_flaating 


MULT 


Multiply T_f)oating 


SUBF 


Subtract F_floating 


SIJBD 


Subtract D_floating 


SUBO 


Subtract O— floating 


SUBS 


Subtract SuJIoating 


SUBT 


Subtract T.floating 
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APPENDIX A 
BYTE MANIPULATION 



Al. Software notes for Compare bvte CMPBGE instruction: 
The result of CMPBGE can be used as an input to ZAP and ZAPNOT. 
To scan for a byte of zeros in a character string, do: 
< initialize Rl to aligned QW address of string > 

LXDOP: 

LDQ R2,0 (Rl) ; Pick up 8 bytes 

LDA Rl,8 (Rl) ; Increment string pointer 

CMPBGE R31,R2.R3 ; If NO bytes of zero, R3<7:0> =0 

BEQ R3,LOOP ; Loop if no terminator byte found 

; At this point, R3 can be used to determine 

; which byte terminated 



To compare two character strings for greater/less, do: 

< initialize Rl to aligned QW address of stringl> 
< initialize R2 to aligned QW address of string2> 

LOOP: 

LDQ R3,0 (Rl) ; Pick up 8 bytes of stringl 

LDA Rl,8 (Rl) ; Increment stringl pointer 

LDQ R4,0 (R2) ; Pick up 8 bytes of string2 

LDA R2,8 (R2) ; Increment string2 pointer 

XOR R3,R4,R5 ; Test for all equal bytes 

BEQ R3,LOOP ; Loop if all equal 

CMPBGE R31,R5,R5 ; 

; At this point, R5 can be used to index 
; a table lookup of the first not-equal 
; byte position 



To range-check a string of characters in Rl for ■0'.,'9', do: 

LDQ R2,lit0s ; Pick up 8 bytes of the character BELOW '0' 

; 1 II I II I r 

LDQ R3,lit9s ; Pick up 8 bytes of the character ABOVE '9' 

. • • 

CMPBGE R2,R1,R4 ; Some R4<i> »1 if character is LT than '0' 

CMPBGE R1,R3,R5 ; Some R5<i> = 1 if character is GT than '9' 

BNE R4,ERR0R ; Branch if some char too low 

BNE R5,ERR0R ; Branch if some char too high 
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The comments in the examples below assume that (X mod 8) =5, the value of the 
aligned quadword containing X is CBAxxxxx, and the value of the aligned quadword 
containing X-f 7 is yyyHGFED. The examples below are the most general case; if 
more information is known about the value or intended alignment of X, shorter 
sequences can be used. 

The intended sequence for loading a quadword from unaligned address X is: 



LDQ U 

LDQ"U 

IDA" 

EXTQL 

EXTQH 

OR 



R1,X 

R2.X+7 

R3,X 

R1,R3,R1 

R2,R3,R2 

R2,R1,R1 



Ignores va<2:0>, Rl = CBAxxxxx 
Ignores va<2:0>, R2 >= yyyHGFED 
R3<2:0> = (X mod 8) = 5 
Rl = OOOOOCBA 
R2 = HGFEDOOO 
Rl = HGFEDCBA 



The intended sequence for loading and zero-extending a longword from unaligned 
address X is: 

LDQ U R1,X ; Ignores va<2:0>, Rl = CBAxxxxx 

LDQ'U R2,X+3 ; Ignores va<2:0>, R2 = yyyyyyyD 

LDA' R3,X ; R3<2:0> = (X mod 8) = 5 

EXnX R1,R3,R1 ; Rl » OOOOOCBA 

EXTLH R2,R3,R2 ; R2 = OOOOODOOO 

OR R2,R1,R1 ; Rl = OOOODCBA 



The intended sequence for loading and sign-extending a longword from unaligned 
address X is: 



LDQ U 


R1.X 


Ignores va<2:0>, Rl «= CBAxxxxx 
Ignores va<2:0., R2 = yyyyyyyD 


LDQ"U 


R2,X+3 


LDA" 


R3,X 


R3<2:0> = (X mod 8) = 5 


EXTLL 


R1,R3,R1 


Rl = OOOOOCBA 


EXTLH 


R2,R3.R2 


, R2 = OOOODOOO 


OR 


R2,R1,R1 


, Rl = OOOODCBA 


SLL 


R1,#32,R1 


, Rl = DCBAOOOO 


SRA 


R1,#32,R1 


, Rl - ssssDCBA 



The intended sequence for loading and zero-extending a word from unaligned address 
X is: 



LDQ U 

LDQ'U 

LDA" 

EXTWL 

EXTWH 

OR 



R1,X 

R2,X+1 

R3^ 

R1,R3,R' 

R2.R3,R. 

R2,R1,R1 



Ignores va<2:0>, Rl yBAxxxxx 
Ignores va<2:0>, R2 » yBAxxxxx 
R3<2:0> «= (X mod 8) = 5 
Rl = OOOOOOBA 
R2 = 00000000 
Rl = OOOOOOBA 



The intended sequence for loading and sign-extending a word from unaligned address 
X is: 
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LDQ U 


R1,X 


Ignores va<2:0>, Rl = 


yBAxxxxx 


LDQ"U 


R2,X+1 


, Ignores va<2:0>, R2 = 


yBAxxxxx 


LDA" 


R3,X 


, R3<2:0> = (X mod 8) 


= 5 


EXTWL 


R1,R3,R1 


Rl = OOOOOOBA 




EXTWH 


R2.R3,R2 


R2 = 00000000 




OR 


R2,R1,R1 


Rl «= OOOOOOBA 




SLL 


R1.#48,R1 


Rl = BAOOOOOO 




SRA 


R1,#48,R1 


Rl « ssssssBA 





The intended sequence for loading and zero-extending a byte from address X is: 

LDQ U Rl.X ; Ignores va<2:0>, Rl = yyAxxxxx 

LDA" R3,X ; R3<2:0> = (X mod 8) = 5 

EXTBL R1,R3.R1 ; Rl = OOOOOOOA 



The intended sequence for loading and sign-extending a byte from address X is; 



LDQ U 

LDA" 

EXTBL 

SLL 

SRA 



R1,X 

R3.X 

R1,R3,R1 

R1,#56,R1 

R1,#56,R1 



Ignores va<2:0>, Rl = yyAxxxxx 
R3<2:0> = (X mod 8) = 5 
Rl = OOOOOOOA 
Rl = AOOOOOOO 
Rl - sssssssA 



Optimized examples: 

Assume that a word fetch is needed from 10(R3), where R3 is intended to contain a 
longword-aligned address. The optimized sequences below take advantage of the 
known constant offset, and the longword alignment (hence a single aligned longword 
contains the entire word). The sequences generate a Data Alignment Fault if R3 
does not contain a longword-aligned address. 



The intended sequence for loading and zero-extending an aligned word from 10(R3) 
is: 

LDL R1,8(R3) ; Rl = ssssBAxx 

; Faults if R3 is not longword aligned 
EXTWL R1,#2,R1 ; Rl = OOOOOOBA 



The intended sequence for loading and sign-extending an aligned word from 10(R3) 
is: 

LDL R1,8(R3) ; Rl = ssssBAxx 

; Faults if R3 is not longword aligned 
SRA R1,#16,R1 ; Rl = ssssssBA 



A3. Software notes for bvte mask instructionv 
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The comments in the examples below ^ume that (X mod 8)-5, the value of the 
aligned quadword containing X is CBAxxxxx, the value of the aligned quadword 
containing X+7 is yyyHGFED, and the value to be stored from R5 is hgfedcba. The 
examples below are the most general case; if more information is known about the 
value or intended alignment of X, shorter sequences can be used. 

The intended sequence for storing an unaligned quadword RS at address X is: 

! R6<2:0> = (X mod 8) «= 5 
! Ignores va<2:0>, R2 = yyyHGFED 
! Ignores va<2:0>, Rl = CBAxxxxx 
! R4 « OOOhgfed 
! R3 = cbaOOOOO 
! R2 = yyyOOOOO 
! Rl «= OOOxxxxx 
! R2 » yyyhgfed 
! Rl = cbaxxxxx 
! Must store high then low for 
! degenerate case of aligned QW 

The intended sequence for storing an unaligned longword R5 at X is: 



LDA 


R6. X 


LDQ U 


R2, X+7 


LDQ U 


Rl, X 


INSQH 


R5, R6. R4 


mSQL 


R5, R6, R3 


MSKQH 


R2, R6. R2 


MSKQL 


Rl, R6, Rl 


OR 


R2, R4, R2 


OR 


Rl, R3, Rl 


STQ U 


R2, X+7 


STQ"U 


Rl, X 



LDA 

LDQ U 

LDQ"U 

INSLH 

INSLL 

MSKLH 

MSI \ 

OR 

OR 

STQ U 
STQ~U 



R6, X 
R2, X+3 
Rl, X 
R5, R6, R4 
R5, R6. R3 
R2, R6, R2 
Rl, R6, Rl 
R2, R4, R2 

RU R3. Rl 
R2. X+3 
Rl, X 



! R6<2:0> = (X mod 8) = 5 

! Ignores va<2:0>, R2 = yyyyyyyD 

! Ignores va<2:0>, Rl = CBAxxxxx 

! R4 » OOOOOOOd 

! R3 = CbaOOOOO 

! R2 = yyyyyyyO 

! Rl = cbaxxxxx 

! R2 = yyyyyyyd 

! Rl = cbaxxxxx 

! Must store high then low for 

I degenerate case of aligned 



The intended sequence for storing an unaligned word R5 at X is: 



LDA 


R6, X 


LDQ U 


R2, X+l 


LDQ"U 


Rl, X 


insWh 


R5. R6, R4 


INSWL 


R5, R6, R2 


MSKWH 


R2, R6, R2 


MSKWL 


Rl, R6, Rl 


OR 


R2, R4, R2 


OR 


Rl, R3, Rl 


STQ U 


R2, X+l 


STQ'U 


Rl, X 



! R6<2:0> = (X mod 8) « 5 

! Ignores va<2:0>, R2 = yBAxxxxx 

! Ignores va<2:0>, Rl = yBAxxxxx 

! R4 = 00000000 

! R3 " ObaOOOOO 

! R2 " yBAxxxxx 

! Rl « yOOxxxxx 

! R2 = yBAxxxxx 

! Rl s ybaxxxxx 

! Must store high then low for 

! degenerate case of aligned 



The intended sequence for storing a byte R5 at X is: 



LDA 
LDQ U 
INSBL 
MSKBL 



R6, X 
Rl, X 
R5, R6, R3 
Rl, R6, Rl 



I R6<2:0> = (X mod 8) = 5 
! Ignores va<2:0>, Rl = yyAxxxxx 
! R3 = OOaOOOOO 
! Rl = yyOxxxxx 
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OR Rl, R3. Rl ! Rl = yyaxxxxx 

STQ_U Rl, X 

A4. Additional Detail of Bvte Insert instruction: 

The Byte Insert instructions perform the following operation: 

CASE opcode BEGIN 

INSBL: byte mask <• 00000001 (bin) 

INSWx: byte'mask <- OOOOOOll (bin) 

INSLx: byte mask <- 00001111 (bin) 

INSQx: byte" mask <- 11111111 (bin) 
ENDCASE 

byte_mask <- LEFT_SHIFT(bytc_mask. rbv<2:0>) 

CASE opcode BEGIN 
INSxL: 

byte loc <- Rbv<2:0>*8 
temp <- LEFT SHIFT(Rav, byte loc<5:0>) 
Rc <- BYTE ^AP (temp, NOT(Byte mask < 7:0 >)) 
INSxH: 

byte loc <- 64 - Rbv<2:0>'8 

temp <- RIGHT SHIFT (Rav, byte loc<5:0>) 

Rc <- BYTE_ZAP (temp, NOT(byte_mask<15:8>)) 

ENDCASE 

A5. Additional Detail of Bvte Extract instruction: 

The Byte Extract instructions perform the following operation: 

CASE opcode BEGIN 

EXTBL: byte mask <- 00000001 (bin) 

EXTWx: byte" mask <- OOOOOOll (bin) 

EXTLx: byte mask <• 00001111 (bin) 

EXTQx: byte" mask <- 11111111 (bin) 
ENDCASE 

CASE opcode BEGIN 

EXTxL: 

byte loc <- Rbv<2:0>'8 

temp <- LEFT SHIFT(Rav, byte loc<5:0>) 

Rc <- BYTE ZAP (temp, N0T(5yte mask)) 

EXTxH: 

byte loc <- 64 - Rbv<2:0>*8 

temp <- RIGHT SHIFT (Rav, byte_loc<5:0>) 

Rc <. BYTE ZAl> (temp, NOT(byte mask)) 

ENDCASE 

A6. Atomic Bvte Write: 

An atomic byte write operation is accomplished by the following instruction 
sequence: 
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LDA R6, X ;Load address to R6 from memory loc. X 
•BIC R6,#7,R7 ;R6 BIC using literal #7, result to R7 
retry: LDQ L Rl,0(R7) ;Load Locked from R7 address 
INSBLR5.R6,R3 ;Insert Byte 
MSKBL R1,R6.R1 ;Mask Byte 
OR R1,R3,R1 ; 

STQ C R1,0(R7) ;Store conditional to same location 
BNE" Rl.retiy 

What is claimed is: temal memory being coupled to said processor by a bus, 

1. A method of performing an atomic byte write in an comprising the steps of: 

operation of a processor, said atomic byte write being loading a selected aUgned estemal memory location 
performed in a memory by way of a system bus external " including said selected non-aUgned byte location 
to said processor, said system bus connecting said pro- via said bus to intwnal register means m said pro- 
cessor to said memory, comprising the steps of: cessor. and thereafter drtectmg by said processor 
loading a content of a selected aligned multibyte store via said bus to said selected aligned 

external memory location from said memory to „ external memory location; 

internal register means in said processor, said con- " P^vidmg an mdicauon m said P'^^^J^J^y 

tent includtag a non-aligned byte location to be ««« » ««««« '° 

written and thereafter detecting on said system bus .^^^^^^ j"^, ^ne selected byte in said internal 

any other store to said selected aligned-multibyte «P » ^ ^^^^ ^ ^ ^^^^^ 

external memory location; 2 J provide a modified content; and thereafter 

providmg m said processor an mdicauon if any other conditionaUy storing said modified content from said 

store is made to said selected aligned-multibyte .^^^^ ^^^^^^ ^ processor to said 

external memory location; selected aligned external memory location via said 

replacing in said internal register means a value in depending upon whetiier said indication U 

said non-aligned byte location with a new value to 30 present. 

be written, while leaving remainder of said content jj. A method according to claim 11 wherein said 

intact; and thereafter steps of loading and providing said indication are per- 

conditionally storing by said processor said content fonned by one instruction executed by said processor, 

of said internal register means to said selected gj,^ sgjd jtep of conditionally storing is performed by 

aligned multibyte external memory location, de- J5 another one instruction executed by said processor, 

pending upon whether said indication is present. 13, a method according to claim 11 wherein said 

2. A method according to claim 1 including the step selected aligned-multibyte memory location is con- 
of providing an indicator in said processor of whether tained in a memory and said memory is aligned on long- 
er not said step of conditionally storing executed a word (4-byte) boundaries. 

store. *° 14. A method according to claim 1 1 wherein said step 

3. A method according to claim 2 wherein said steps of providing an indication resets a status bit in the pro- 
of conditionally storing and providing said indicator are cessor, said status bit being set upon said step of loading, 
performed by one instruction executed by said proces- 15, \ method according to claim 11 wherein said 
sor. processor is a signal-chip integrated circuit device. 

4. A method according to claim 2 including, after said ^ method according to claim 11 including the 
step of conditionally storing, the step of executing by of providing an indicator in said processor if said 
said processor a branch which is conditional upon said g,gp gf conditionally storing finds said indication to be 
tadicator, present. 

5. A method according to claim 1 >»'h':rcin said step of ^ .^^^ according to claim 1« including, after 
providmg an mdication resets a sutus bit. said sutus bit ^ conditionally storing, the step of executing 
bemg set upon said step of loadmg. by said processor a branch which is conditional upon 

6. A method according to claim 5 wherein said status ^ indicator 

fa iiS^,ii^r**"""~''°^'"'°**^'**°''""'*'''" A method ««ording to claim 11 wherein said 

7. A method according to claim 1 wherein said se- " aligned-multibyte J" J^^' 
lectcd aligned-multibyte memory location is contained « """"^^ » °° 

in a memory and said memory is aligned on quadword "^""^ <'' '»y*«) boundaries. 

(8-byte) boundaries. W- A computer system having a capability of per- 

8. A method according to claim 1 wherein said stop gQ forming an atomic byte write, comprising: 

of loading and providing said indication are performed a CPU and a shared memory, said CPU acceuing said 

by one instruction executed by said processor. shared memory by a system bus; said memory hav- 

9. A method according to claim 1 wherein said se- ing an access path for aligned memory locations of 
lected aligned-multibyte memory location is contained multi-byte width; 

in a memory and said memory is aligned on longword ^5 means for generating in said CPU a read request for a 

(4-byte) boundaries. non-aligned date item and for sending to said mem- 

10. A method according to claim 1 wherein said pro- ory via said bus an aligned read request for a con- 
cessor is a single-chip integrated circuit device. tent of i location in said memory which includes 

11. A method of writing by a processor to a selected said non-aligned data item, said non-aligned data 
non-aligned byte location in external memory, the ex- item being of at least one byte in Avidth; 
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register means in said CPU for receiving said content 
rrom said memory via said bus and storing said 
content; 

means for detecting in said CPU another request on 
said bus for said location and for providing a re- 3 
corded indication of detection of said another re- 
quest; 

means in said CPU for altering only said data item in 
said register means to produce an altered content; 
means in said CPU for generating a write request for 10 
storing said altered content in said location in said 
memory, and for conditionally sending said write 
request to said memory via said bus depending 
upon said recorded indication. 
20. A system according to claim 19 including means IS 
in said CPU for indicating whether said recorded indi- 
cation was present when conditionally sending said 
write request to said memory, and further including 
means for executing a conditional branch afler said 

20 



conditional sending, said conditional branch being re- 
sponsive to said means for indicating. 

21. A system according to claim 20 wherein said 
means for indicating includes a register of a register set 
accessible within said CPU, and wherein said register 
means is another register of said register set. 

22. A system according to claim 21 wherein said 
means for generating and said means for detecting em- 
ploy a signal instruction executed by said CPU, and 
wherein said means for generating a write request and 
conditionally sending and said means for indicating 
employ another single instruction executed by said 
CPU. 

23. A system according to claim 22 wherein said 
memory is aligned on quadword (8-byte) boundaries. 

24. A' system according to claim 22 wherein said CPU 
is a single-chip integrated circuit device. 
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